NiW: Converting Notebooks into Workflows to Capture Dataflow and Provenance
نویسندگان
چکیده
Interactive notebooks are increasingly popular among scientists to expose computational methods and share their results. However, it is often challenging to track their dataflow, and therefore the provenance of their results. This paper presents an approach to convert notebooks into scientific workflows that capture explicitly the dataflow across software components and facilitate tracking provenance of new results. In our approach, users should first write notebooks according to a set of guidelines that we have designed, and then use an automated tool to generate workflow descriptions from the modified notebooks. Our approach is implemented in NiW (Notebooks into Workflows), and we demonstrate its use by generating workflows with third-party notebooks. The resulting workflow descriptions have explicit dataflow, which facilitates tracking provenance of new results, comparison of workflows, and sub-workflow mining. Our guidelines can also be used to improve understandability of notebooks by making the dataflow more explicit.
منابع مشابه
Dataflow Notebooks: Encoding and Tracking Dependencies of Cells
Computational notebooks have seen widespread adoption among scientists in many fields, and allow users to view interactive graphical results inline, to embed text and code together, to organize code into cells, and to selectively edit and re-execute cells. Because they allow quick and recordable analyses, they play an important role in documenting experiments. However, the reproducibility of no...
متن کاملA Graph Model of Data and Workflow Provenance
Provenance has been studied extensively in both database and workflow management systems, so far with little convergence of definitions or models. Provenance in databases has generally been defined for relational or complex object data, by propagating fine-grained annotations or algebraic expressions from the input to the output. This kind of provenance has been found useful in other areas of c...
متن کاملAtomicity and provenance support for pipelined scientific workflows
Today many significant scientific discoveries are achieved through complex and distributed scientific computations that are structured and represented as scientific workflows. Although atomicity is a well studied topic in transaction processing and business workflows, such an important capability needs to be revisited in a scientific workflow environment. Firstly, the semantics of atomicity nee...
متن کاملA Dataflow-Oriented Atomicity and Provenance System for Pipelined Scientific Workflows
Scientific workflows have gained great momentum in recent years due to their critical roles in e-Science and cyberinfrastructure applications. However, some tasks of a scientific workflow might fail during execution. A domain scientist might require a region of a scientific workflow to be “atomic”. Data provenance, which determines the source data that are used to produce a data item, is also e...
متن کاملMapping the NRC Dataflow Model to the Open Provenance Model
The Open Provenance Model (OPM) has recently been proposed as an exchange framework for workflow provenance information. In this paper we show how the NRC data model for workflow repositories can be mapped to the OPM. Our mapping includes such features as complex data flow in an execution of a workflow; different workflows in the repository that call each other; and the tracking of subvalues of...
متن کامل